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Abstract 


Nearly two dozen states now administer online exams. These tests have real consequences: their results 
feed into accountability systems, which have been used for more than a decade to hold schools and 
districts accountable for their students’ learning. We examine the rollout of computer-based testing in 
Massachusetts over 2 years to investigate test mode effects. Crucial to the study design is the fact that 
the state administers the same exam (PARCC) in online and offline formats each year during the 
transitional period. We find mode effects of about 0.10 standard deviations in math and 0.25 standard 
deviations in English language arts (ELA). Our preferred estimates of the effects of online testing in the 
second year of administration suggest that mode effects for second-time test takers were about one 
third as large as the first year in math and about half as large in ELA. There is little evidence of 
systematic variation in mode effects by student demographic groups, although on ELA tests they are 


larger for students scoring at the bottom of the achievement distribution. 


1. Introduction 

Computer-based testing is rapidly spreading across the assessment landscape. There are 
significant advantages of using computer-based tests (CBTs) to assess student achievement, such as more 
flexibility in test item design, access to a large repository of items, and faster turnarounds for score receipt 
(Parhizgar, 2012). In addition, computer literacy is essential in the workplace, for both finding jobs and 
succeeding at them (Murray, Sherburn, & Pérez, 2007). Not surprisingly, many national and international 
assessments are in the process of moving online or have already done so, including the ACT, PISA, and 
NAEP.' The two consortia of Common Core-based tests, Partnership for Assessment of Readiness for 
College and Careers (PARCC) and Smarter Balanced, are also in the process of transitioning to online 
testing.” And as of 2016, nearly two dozen states administer K-12 state assessments online (Farmer, 
2016). 

Anecdotal reports from the initial implementation of PARCC in several states have revealed 
lower scores for students who take PARCC online.? Examples abound: in Ohio, a survey of districts 
found that 85% of districts administering PARCC on paper received an “A” grade, compared to only 17% 
of online districts. In Illinois, 43% of students who took PARCC on paper scored proficient or above in 
ELA, compared to 36% of students who took the test online. In Maryland, middle school ELA students in 
Baltimore County scored worse on the online test after controlling for student background and prior 
achievement.? 

In this paper, we use the rollout of online PARCC in Massachusetts to investigate whether these 
findings are reflective of real differences in student achievement across districts or of test mode effects. In 


2015, some districts began transitioning to the PARCC assessment. These districts had the choice of using 


' For convenience, we use CBTs and online tests interchangeably throughout. 

* http://parcconline.org/assessments/test-design/whats-different-this-year 

3 PARCC requires that schools have an Internet connection to administer the test: 
http://www.parcconline.org/files/72/Technology%20Guidelines %20for%20PARCC%20Assessments/389/Technolo 
gyGuidelinesPARCCAssessments-v5_0-Jan2016.pdf 

4 O’Donnell, P. (2016, March 7). School districts got A grades on paper tests, but F grades online, survey shows. 
Cleveland.com. 

> Herold, Benjamin. (2016, February 3). PARCC Scores Lower for Students Who Took Exams on Computers. 
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the paper or online version of the test, and nearly half administered the online format in 2015 or 2016.° 
Crucially, the PARCC is not computer adaptive and many of the test items are similar in both the online 
and offline formats. Although some other states provided schools or districts with the option of taking the 
PARCC assessment on paper forms in the first years of implementation, the percentage of students taking 
the test on paper is lower in some states and has diminished over time (Educational Testing Service et al., 
2016; Pearson, 2017). Massachusetts’s experience administering the test therefore provides an 
opportunity to measure mode effects for a representative population of students; however, the results may 
not necessarily generalize to other assessments or to its use with particular subpopulations of students. 
We investigate two potential concerns related to the transition to online testing. The first is 
whether students administered an online exam score systematically lower than if they had taken the test 
on paper in states that administer their test online to some students and on paper to others. We find strong 
evidence that this is the case. In particular, students taking the online version of PARCC scored about 
0.10 standard deviations lower in math and about 0.25 standard deviations lower in English language arts 
(ELA) than students taking the paper version of the test. After several specification checks, we conclude 
that these differences cannot be explained by selection or prior student achievement. For example, we 
conduct a placebo test using science exams — which are administered on paper for all schools, even those 
that switched to CBTs for math and ELA — and show that the schools with large declines in student 
achievement when administering online PARCC in math and ELA have no relative change in measured 
science achievement. Our estimates of mode effects in math and ELA represent extremely large changes 
in measured student learning: up to 5.4 months of learning in math and 11.0 months of learning in ELA in 
a9 month school year. Our preferred estimates of the effects of online testing in the second year of 
administration suggest that mode effects for second-time test takers were about one third as large as the 


first year in math and about half as large in ELA. Student familiarity with online testing explains part of 


® As described below, three districts allowed schools to make their own decisions about adopting online testing, so 
our estimation strategy uses school-level variation in online testing rather than district level. 
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the reduction in effect size in both subjects, and we find some evidence of general improvements in 
student performance on the online ELA test. 

The second potential concern is whether states that have fully moved to online testing should 
expect heterogeneity in test mode effects based on student background, which could cause test mode 
effects to be conflated with true differences in student achievement across subgroups. For instance, there 
may be systematic differences in students’ comfort level with CBTs depending on their access to 
computers in the home and at school. Low income and minority households are significantly less likely to 
have high speed internet connections (Richtel, 2012). Urban schools are also less likely to have computers 
with Internet access, making it more difficult to implement or practice with CBTs (Garland, 2012). While 
we find little systematic evidence of variation in treatment effects by student demographic group in math, 
we find that ELA mode effects are stronger for students at the bottom of the achievement distribution, for 
English language learners, and for special education students. 

The differential between students’ online and paper based performance could have important 
consequences, not only for students themselves but also for their teachers, schools, and communities. For 
instance, standardized tests have a wide range of potential consequences for students, including 
identification for gifted and talented programs, consideration for special education programs, and being 
flagged for grade retention. Our findings on CBTs are also consequential beyond the impact on individual 
students in three important ways. First, in many states, the results of standardized assessments support 
teacher evaluation, school accountability determinations, student graduation, or the distribution of school 
resources. Second, families make residential location decisions in part based on measured the measured 
performance of local schools (Black & Machin, 2011). Third, student test scores are frequently used as an 
outcome by researchers. Many of the large states frequently used to conduct education research, such as 


Michigan, Massachusetts, New York, and North Carolina, have recently administered both paper and 


CBTs simultaneously or plan to in the near future.”* The public K-12 systems in these four states alone 
contained nearly 7 million students in 2014-15, or approximately 13 percent of public school children in 
the United States.° Failure to take into account mode effects may lead to biases in empirical analyses of 


education policies, especially for analyses of school-wide programs. 


2. Background 

In 2015 and 2016, PARCC administered paper and online versions of the assessment. The paper 
versions of the PARCC assessment were adapted from the online forms and used a similar set of items. 
The online versions of the test included some interactive questions, and the paper and online versions 
were not exactly equivalent in any grade or year. However, both modes did include a subset of linked 
items to facilitate the reporting of student scores on a common scale (Educational Testing Service et al., 
2016; Pearson, 2017). Following the administration of the test, PARCC scored the tests for each mode 
separately and then transformed results from the paper tests onto the online scale using results from the 
common set of linked items. The scores were therefore intended to be comparable across modes. 

Nonetheless, news reports have documented relatively high scores of students taking paper, as 
compared to online, tests in several PARCC states (Brown, 2016; Farmer, 2016; O’Donnell, 2016; Tuoti, 


2014). Despite the significant difference in measured achievement across test modes, only a few studies 


7 Across all PARCC states, about 80 and 90 percent of students took the assessment online in 2015 and 2016, 
respectively (Educational Testing Service et al., 2016; Pearson, 2017). In Michigan, about 80 percent of schools 
administered assessments online in 2015 and 95 percent in 2016 (Michigan Department of Education, 2016). New 
York State began transitioning to online testing in 2017. Participation in online testing is currently voluntary and 
schools can elect to administer tests for individual grades and subjects online. The state plans to have all schools 
using CBTs by 2020 (New York State Education Department, 2016, 2017). North Carolina has been transitioning its 
tests online since 2015. Starting in 2018, the grade 3-8 assessments will all be available in both online and paper 
formats (North Carolina Department of Public Instruction, 2017). Other states, such as Florida, have administered a 
single mode for each assessment, but have rolled out online tests in different grades over multiple years (Florida 
Department of Education, 2017). 

8 These states are frequently sites of education research. Based on a search of papers using the I2 series JEL codes 
published in American Economic Review, Quarterly Journal of Economics, or Journal of Public Economics between 
2013 and 2017, of those listing a state in the abstract, 26% (5/19) reported one of these four states. 

° Digest of Education Statistics, 2016. Table 203.40. https://nces.ed.gov/programs/digest/d 1 6/tables/dt16_203.40.asp 
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have attempted to estimate mode effects on new tests aligned with the Common Core. The PARCC 
consortium conducted a study based on results from field tests and concluded that “there is substantial 
evidence indicating that the differences in comparability across mode are relatively minor” (Brown et al., 
2015, p. 71), although mode effects were larger in writing tasks (Pearson, 2017). On the other hand, 
Duque (2017) found mode effects of between 0.05 and 0.44 standard deviations in favor of students 
taking the paper version of the PARCC in Baltimore County Public Schools. Assessing the research 
literature on other computerized tests at the K-12 level, Wang et al. (2007) conclude that the average 
study finds that students taking a paper test score about 10% of a standard deviation higher than those 
taking a CBT. 

One possibility for the reported mode effects reported is that they reflect true differences in 
achievement across districts. For instance, if lower-achieving districts were more likely to switch to 
CBTs, we would expect their subsequent performance on CBTs to be lower even in the absence of a 
mode effect. Another potential reason for the mode effects observed in the studies mentioned above is 
temporary adjustment to a new test format. PARCC offers this explanation, with chief of assessment Jeff 
Nellhaus attributing some of the difference to “student familiarity with the [CBT] platform” (Brown, 
2016). If this were the case, mode effects may lower all students’ achievement and dissipate over time, 
and states and districts should want to exercise caution in using transition-year scores in accountability 
systems but expect the issue to eventually correct itself. Although average achievement may not be 
strictly comparable across years and should be expected to rise as students become accustomed to the 
tests, achievement gaps and accountability measures would have a consistent interpretation. On the other 
hand, if mode effects differ by student characteristics and do not fade over time, then switching to online 
tests could affect school accountability ratings or demographic achievement gaps in unexpected ways. 

Several empirical findings suggest that CBT mode effects may depend on student or school 
characteristics. First, in contrast to paper-and-pencil tests, there may be meaningful differences across 
schools in the instruments used to complete CBTs. Some schools have limited access to fully updated 
computers or the high-speed Internet access required for the online testing platform. For instance, urban 
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schools have fewer computers per student and their students are less likely to report using computers 
frequently for school work (Snyder et al., 2018). Differences in computing facilities may matter because 
several studies have found that technological aspects of the CBTs may affect student performance. Davis 
et al. (2013) reported that students found testing devices with larger screens easier to read. And, in an 
analysis for the PARCC consortium, Steedle et al. (2016) found evidence of differences in performance 
on online tests across different types of devices in at least one state. 

Second, CBTs may also measure skills, such as computer literacy, for which student proficiency 
differs. The online versions of the PARCC use web design features that might be unfamiliar to students 
who have less prior exposure to computers or the Internet. In Figure 1, we display reading passages from 
the sample PARCC assessment’s paper and online formats. The paper version of the test (Figure 1a) 
displays reading passages across multiple pages in the test booklet. On the other hand, the online version 
(Figure 1b) displays the full passage in a box embedded in a single page with multiple-choice questions. 
Students are expected to scroll down in the text box to view the complete passage. 

The online version also uses multiple display formats for multiple-choice questions, as shown in 
Figure 2. The paper version (Figure 2a) asks which answer best describes what the picture adds to the 
story and then provides several possible responses labeled with a corresponding entry on the answer 
sheet. Although most questions use a familiar radio button display for the multiple-choice questions, the 
assessments also use other formats for some questions. For the question in Figure 2, the online version 
(Figure 2b) includes an additional question (“What is Happening?”) and then asks the student to identify 
what the passage adds and presents three options in a dropdown menu. The online assessment also uses 
drag-and-drop and other interactive response methods. 

Finally, the two versions of the test use different structures for free response questions. In Figure 
3, we show the formats for an essay question in which students respond to a reading passage. The paper 
assessment (Figure 3a) includes two ruled sheets for students to write out their responses. For the online 
version, students type their responses in a text box located next to the reading prompt (Figure 3b). The 
text box includes some basic editing, formatting, and spellcheck features. The box also expands to 
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accommodate students’ responses, although the amount of space displayed on the page is less than the 
corresponding space in the paper version. 

If students have differing familiarity with these features, their accuracy in recording responses 
may differ. Although Massachusetts is a relatively wealthy state, computer penetration lags behind in 
some areas, especially for minority households. Of the 52 Public Use Microdata Areas (PUMAs) in the 
state, 19 report household computer access upwards of 95%.'° However, in several regions, fewer than 
80% of households have access, and in many of those, fewer than 70% of Black and Hispanic households 
have computer access. In the PUMA containing Springfield, an area with a heavy minority population 
(non-Hispanic Whites make up only 37% of the population) and the second-largest school district in the 
state, about 35% of Hispanics and 28% of Blacks report no computer access at home. Disparities in 
computer access may moderate mode effects on standardized tests. Studying an online version of the 
NAEP writing test, White et al. (2015) found that home Internet access was associated with response 
length, use of editing tools, and preference for the computer test. Perhaps as a consequence, high-ability 


students tend to score disproportionately high on CBTs (Clariana & Wallace, 2002; White et al., 2015). 


3. Setting and Sample 

Massachusetts adopted new state curriculum frameworks incorporating the Common Core State 
Standards in 2011, with implementation beginning in the 2012-13 school year. Until 2014, all districts 
used the Massachusetts Comprehensive Assessment System (MCAS), which was administered on paper. 
Beginning in 2015, districts chose between MCAS and the new PARCC assessment.!! It is important to 
note that Massachusetts implemented a hold-harmless provision for all schools administering the PARCC 


assessment in 2015 and 2016, whether online or on paper. During this period, no school’s accountability 


‘0 Authors’ calculations using data from the 2013 and 2014 American Community Survey (Ruggles et al., 2015). 

‘| Except for Boston, Worcester, and Springfield, which had the option of assigning individual schools to the online 
or paper format, districts selected a single test administration for the entire district. In November 2015, the 
Massachusetts State Board of Education voted to discontinue the PARCC assessment and implement a redeveloped 
version of the MCAS in all schools beginning in 2017. 


rating could fall as a consequence of their PARCC scores. As shown in Table 1, about 72% of elementary 
or middle schools in our sample administered the PARCC in either 2015 or 2016. PARCC districts had 
the additional option of offering the test online or on paper. Of those schools administering the PARCC in 
either 2015 or 2016, 57% administered the test online at least once. 

In this study, we use student achievement data for students in Massachusetts public schools 
during the 2011—2016 school years in Grades 3 through 8 (as described below, some of our models will 
restrict our sample to Grades 5 through 8). We use longitudinal student achievement data that has been 
linked to student data in the Student Information Management System (SIMS) by the state, which 
includes information on students’ enrollment status, demographics, and program participation. 

We make several sample restrictions to properly identify student test mode. First, we limit our 
sample to schools that administered the PARCC in both 2015 and 2016 to ensure that achievement is 
measured on a common scale in each year.'” Because we implement difference-in-differences designs and 
use lagged school outcomes as key regressors, we also exclude schools that were not operating 
continuously between 2011 and 2016. Our final sample includes about half of all students enrolled in 
Grades 3 through 8 between 2011 and 2016 and 88 percent of students in schools administering the 
PARCC in 2015 and 2016." 

We depict achievement trends graphically in Figure 4, which plots mean standardized test scores 
by year for schools switching to the PARCC assessment in 2015 based on their mode decisions. This 
figure shows the first suggestion of mode effects: although test scores are higher in the online districts in 
2011-2014, the schools that switch to online testing in 2015 then see their achievement fall behind the 
Districts that switched to PARCC in 2015 could not switch back to MCAS in 2016. We also omit 3,229 
observations for students in schools where more than 5 percent of students have a test mode that does not match the 
typical choice in their school. Massachusetts translated PARCC scale scores to equivalent MCAS scale scores 
(Massachusetts Department of Elementary and Secondary Education, 2016). Given the significant differences 
between the MCAS and PARCC schools in terms of student observables (Table 2), we do not use the rescaled scores 
in this analysis. Before 2015, we use the MCAS scores standardized within the set of PARCC schools that comprise 
this sample. In 2015 and 2016, we similarly standardize the PARCC scores. The standard deviation of test scores in 
this sample is between 0.96 and 1.01 standard deviations measured in the full sample in each grade, subject, and 
year, so this standardization does not materially affect the coefficient estimates presented in this paper. 


'3 Besides omitting schools that open during this period, students in PARCC schools can be missing from the sample 
for reasons such as taking an alternate version of the assessment or opting out of the assessment. 
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schools that remained on paper.'* We display summary statistics in Table 2, dividing the sample between 
the three test modes: MCAS, PARCC online, and PARCC paper. Each sample contains hundreds of 
unique schools and hundreds of thousands of student-year observations. Schools administering the 
PARCC exam on paper tend to have lower test scores and more low-income students than those 
administering the PARCC online. 

Consistent with Figure 4, average MCAS achievement in online districts prior to the 
implementation of PARCC is about 0.09 standard deviations higher in math and 0.11 standard deviations 
higher in ELA than in paper districts. Although prior test scores are higher in the online districts, current 
test scores are 0.12 standard deviations lower in ELA and 0.02 standard deviations lower in math. 
Regarding differences in baseline characteristics, 44% of students taking the paper test and 37% of 
students taking the online test qualify for subsidized lunches. In the next section, we discuss our strategies 


for accounting for the clear differences in baseline characteristics. 


4. Empirical Strategy 

Assessments are intended to measure how much a given student knows. We conceptualize test 
mode effects by writing student 7’s test score at time fin school s given test mode m as a function of 
knowledge, a mode-specific component, and noise: 

ist = Kise + Cise + Eise- 

In the preceding, measured achievement depends on a student’s true knowledge, kj,¢, which is constant 
regardless of whether a test is administered online or on paper; a student-specific test mode effect, cj”; 
and a mean zero error term. We refer to the online test mode effect as representing generic “computer- 


specific ability,” c/”,, which may be positive or negative for CBT-takers and is zero for paper-takers and 


'4 The scores of paper schools increasing when online testing was introduced to the online schools is a mechanical 
consequence of test scores being standardized to have mean zero within subject, grade, and year. Because mean 
scores in the online schools fall substantially, mean scores in the paper schools must rise in order for test scores to 
remain centered at zero. We show an alternate version of Figure 4 as Appendix Figure 1 where we use scale scores 
(transformed onto the MCAS scale in case of PARCC assessment) instead of normed scores. 
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may reflect school factors (e.g., availability of sufficient computers, quality of Internet connection at the 
school) or the quality of the test design. Thus, relative to paper tests, the properties of CBT-measured 
achievement can vary because of differences in cj”, across modes. 

For each student in the sample, we only observe achievement on one version of the test. That is, 
we measure 

Aist = Kist + Cie X CBT st + Nise, 

where ist = ef xX CBT sp + Ef X (1 — CBTsp) is the shock on the given exam. The key empirical 
challenge is that student ability k;,. may be correlated with test mode status. The summary statistics in 
Table 2 suggest this is the case, and below we describe how we account for these differences in ability. 

If students who took the paper version of PARCC tend to do better, even controlling for prior test 
scores and demographic factors, there are several potential factors for why this may be the case: (1) The 
paper version may be generically “easier” in some sense, meaning that, on average, a given student would 
be expected to score higher on the paper test than on the computer test, perhaps due to factors such as 
ease of reviewing and revising previous responses (Wang et al., 2007). (2) The paper version may be less 
discriminating among students of differing academic ability so that small preexisting differences in test 
scores across participating and nonpatticipating districts become magnified when switching to online 
testing (similar to the argument in Cascio & Staiger [2012]), who show that increasing variance of 
knowledge in later grades partially explains why estimated treatment effects of interventions are smaller 
in later grades). Or (3) unobservable student or teacher characteristics differ in districts that choose the 
paper and online versions of the test. The latter may be the case, for instance, if average teacher quality is 
higher in the districts that chose the paper test. 

We take two empirical approaches to estimating mode effects. First, we use data on student test 
scores from before the introduction to the online PARCC assessment to adjust for differences in student 
background. In particular, we estimate 


Aist = AAist-2 + aXist + BCBT s¢ + Vt + Nist- (1) 
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In Equation (1) above, i indexes students, s indexes schools, and t indexes years. We regress the test score 
in year ¢ on twice-lagged test scores Aj,¢_2, demographic characteristics X;,¢, and an indicator for whether 
the school administers a CBT in year f, Cs¢.'° Xis¢ also includes a set of school variables, including 
aggregates of the student-level variables and the full history of school average achievement in math and 
ELA between 2011 and 2014. Because we have 2 years of PARCC data, all lagged achievement variables 
in Equation (1) are measured before the introduction of online testing to avoid having online scores as 
both an outcome and control variable in 2016. In addition, Equation (1) is estimated on a sample of 
students in Grades 5-8 because of the need for twice-lagged scores. The coefficient § then measures the 
average loss (or gain) in test scores associated with taking a CBT relative to observationally similar 
students who took a paper test. 

Second, we use data from before the introduction of online testing and assess mode effects using 
a difference-in-differences design. Specifically, we include school fixed effects to remove any 
unobserved, time-invariant differences across districts: 

Aist = &Xise + BCBT st + Xs + Vt + Nise- (2) 

The advantage of this approach relative to Equation (1) above is that the test mode effect f is estimated 
from within-school variation over time. This allows for the possibility of time-invariant preexisting 
differential factors such as teacher or school quality in paper versus online districts. In addition, because 
we include school fixed effects in place of students’ prior achievement, we can include students in third 


and fourth grade and thus use the full sample of Grades 3 - 8. As before, the coefficient f then represents 


'S We use official scale scores standardized by year and grade for both the current and lagged achievement variables. 
Massachusetts transforms the individual student ability estimates into MCAS scale scores using linear 
transformations that differ across performance categories (Massachusetts Department of Elementary and Secondary 
Education, 2015). To ensure that the results are not sensitive to using control variables with a potentially non- 
interval interpretation, we also use a normal curve equivalent transformation of the lagged MCAS achievement 
variables. None of the results shown is sensitive to using these variables in place of the standardized scores. In 
addition, all the empirical analyses estimate standard errors that are robust to arbitrary heteroskedasticity in the error 
term. 
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the additional gain (or loss) associated with taking the test online, holding these time-invariant differences 
across schools constant.!° 

Before moving to the empirical results, we present some preliminary checks on the validity of the 
research designs. We estimate the difference-in-differences models using three leads and one lag of the 
online testing indicator and plot the cumulative effects of online testing by year in Figure 5. We find little 
evidence of preexisting trends in either math or ELA. 

Beyond the usual concerns about selection on observables, the coincidence in the timing between 
the introduction of the PARCC assessment and online testing potentially creates an additional problem for 
estimation approaches that rely either on lagged scores, such as Eq. (1), or within-school variation in test 
scores, as in Eq. (2). In both instances, we use test scores from the MCAS to control for baseline 
achievement, whereas online testing results are generally available only on the PARCC. However, 
because the 2011 Massachusetts Curriculum Framework (aligned with MCAS) is very similar to the 


Common Core State Standards (aligned with PARCC), this may not be a concern.'” 


5. Results 
5.1 Main results 

We begin by displaying our main results in Table 3. In Columns (1) and (4), we regress test 
scores in year ¢ on twice-lagged test scores, demographic information, and means of each at the school 
and school-grade-year levels. The achievement outcomes are measured in 2015 and 2016, whereas the 
lagged achievement measures are derived from MCAS tests administered in 2013 and 2014. Recall that 
we use twice-lagged test scores to avoid having online scores as both an outcome and control variable: the 
test scores used as controls are always from MCAS paper assessments. In the remaining columns, we 


replace the individual- and school-level student achievement controls and estimate difference-in- 


‘6 As shown below, we also experiment with using student fixed effects instead of school fixed effects. Results are 
very similar. 

'7 “The Massachusetts Working Draft Standards overlap with about 90% of the Common Core.” 
http://www.doe.mass.edu/bese/docs/fy2011/2010-07/item1.html 
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differences models. In Columns (2) and (5), we use the same grades as the ordinary least squares (OLS) 
models for comparison. We then include data from the full set of tested grades (3-8) between 2011 and 
2016. 

The results are generally similar across models. We estimate mode effects of -0.10 in math and - 
0.24 in ELA controlling for prior test scores. When we estimate mode effects on the same grades with 
difference-in-differences models, we estimate mode effects of -0.10 and -0.25, respectively. Our preferred 
estimates in Columns (3) and (6), which include data from all grades, are quite similar: -0.11 standard 
deviations in math and -0.24 standard deviations in ELA. The results of Table 3 indicate that students 
taking PARCC online score lower than observationally similar students who take the test on paper. 
5.2 Robustness checks 

In this section, we investigate whether our estimates of test mode effects could be driven by 
unrelated changes in school achievement. In Table 4, we therefore conduct several further tests of the 
main research design. To test whether preexisting trends in school outcomes explain our estimated mode 
effects, we conduct a placebo test on science achievement in Grades 5 and 8, which was administered on 
paper forms throughout this period. We should therefore expect to find null results on these placebo tests. 
In Columns (1) and (2), we replicate the OLS and difference-in-differences regressions in Table 3 using 
the science achievement scores as a dependent variable. In each case, the set of explanatory variables is 
identical to those estimated in previous regressions. We find no evidence of mode effects in either 
specification: The estimates are near zero (-0.005 and -0.001, respectively) and statistically 
insignificant.'® 

In the remaining columns, we use different samples of schools to further test the robustness of our 
results. First, although we do not find evidence of pre-existing trends in online PARCC schools relative to 


paper PARCC schools, the online schools may be differentially effective at teaching the academic skills 


'8 We also estimate difference-in-differences models that also include school-specific linear time trends. The results 
are quite similar to the main difference-in-differences estimates, although we note that Figure 5 does not suggest 
clear evidence of preexisting trends among the online adopters. 
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tested by the PARCC assessments. Although the Massachusetts and Common Core standards were 
similar, we may conflate test mode and test type effects because most schools switched test modes at the 
same time they introduced the PARCC assessment. In columns (3) and (6), we restrict our sample to 2015 
and 2016 and re-estimate the DID models. These models only use variation in test format generated by 
schools that switched modes between 2015 and 2016. Because all schools in the sample administer the 
PARCC during this window, our estimates should not be sensitive to changes in the test content. The 
estimates are quite similar to those in Table 3: we find that switching to an online test mode reduces 
achievement by 0.12 standard deviations in math and 0.27 standard deviations in ELA. 

Next, we broaden the sample to include schools switching to the PARCC in 2016 (schools in 
rows 2 and 3 of Table 1). Most of these schools switched to the paper test, and their prior average 
achievement is significantly lower than schools administering the PARCC online. Nonetheless, their 
inclusion does little to change the estimated mode effects in either math or ELA. Finally, the introduction 
of online testing may affect which students take the standardized tests. Students’ opting out of 
standardized testing was widely reported in some states switching to Common Core aligned tests (Ujifusa, 
2015), and the movement to online tests may have been a more salient signal of changes in testing policy. 
To assuage concerns that changes in the composition of test takers explains our results, we replace 
school-by-grade fixed effects with student fixed effects in Columns (5) and (8), with very similar results 
to our main results in Table 3.'? We conclude that sorting into test mode based on preexisting trends is 
unlikely to drive our findings. 

5.3 Dynamic effects of computer-based testing 

Over time, schools or districts may improve their ability to administer online tests, or students 

may become more familiar with the testing software. We therefore test whether mode effects differ on 


subsequent administrations of the test. We replicate the results in Table 3 using indicators for the number 


'? Tn addition, we directly test for effects of online testing on the likelihood that enrolled students had valid testing 
scores in either math or ELA, replicating the DID specification from Table 3. The coefficient on online testing is 
0.001 and statistically insignificant. 
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of cumulative administrations of online PARCC tests. For the remainder of the paper, we use the DID 
specification used in Columns (3) and (6) in Table 3 so that we can use the entire Grade 3 - 8 sample.”° 
We measure prior exposure to online tests in three ways. First, we measure the number of times a school 
has previously administered the PARCC test online. Second, we measure the number of times a cohort 
has taken the online PARCC. This is generally the same as the school-level measure, except for students 
in third grade who have not previously taken the PARCC. That is, for schools that administered PARCC 
online in 2015, fourth graders and above in 2016 have taken the PARCC online previously, while those in 
third grade in 2016 have not. Finally, we measure the number of times an individual student has taken the 
PARCC online. This differs from the two measures above for students who switch to an online school 
from a paper school in 2016 (or vice versa). We use these sources of variation to separately identify the 
effects of student familiarity and school implementation. Coefficients on subsequent online tests 
measured at the individual level that are significantly smaller than the first time would be taken as 
evidence of student-specific fade out, while coefficients on the school level measure that are significantly 
smaller would suggest that district or school test coordination improves over time. 

We present the results of this exercise in Table 5. In each column, the coefficient on the online 
indicator is presented first and the coefficient for a set of interaction terms is presented second. In each 
case, we also include interactions of the mode effect with grade. We focus on the interaction effects on 
the number of school and student administrations. In Columns (1) and (4), we estimate the effects of 
online testing in the first and second year a school administers the exam. The coefficient on the online 
indicator provides an estimate of the first-time mode effect for third grade students and the interaction 
with second year provides an estimate of the change in the mode effect during the second year of 


administration. In math, we estimate that the mode effect diminishes by about 0.06 standard deviations 


°° Results are similar when using OLS models. 
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between the first and second year. In ELA, the mode effect diminishes by about 0.11 standard 
deviations.”! 

In the next columns, we attempt to disentangle the effects of student familiarity from schoolwide 
improvements in administration by including controls for prior student experience with PARCC. As we 
discussed above, these coefficients describe slightly different sources of variation in the number of test 
administrations. In Columns (2) and (5), we add a control for whether students in a cohort have had a 
prior administration of the PARCC online. This differs from the number of prior school administrations 
only for students in third grade, who have not previously taken the PARCC in either 2015 or 2016 and 
therefore have no prior online PARCC experience. Thus, we compare changes in the mode effect in third 
grade to changes in the mode effect in fourth through eighth grades for students who have previously 
taken an online PARCC test. In Columns (3) and (6), we include an indicator for the second time a 
student takes a PARCC assessment. This variable differs from the school administration variable for 
students in third grade, but it also identifies the effect of prior experience with online PARCC from 
students who switch into online schools from paper schools in different years. 

We find some evidence that a reduction in mode effects results from both student familiarity with 
the online format and school experience administering the exam. In math, student familiarity explains the 
full reduction in the size of the mode effect between the first and second years. Once we adjust for the 
number of prior times a student has taken the exam (the coefficients on prior student administrations and 
prior grade administrations), we find little evidence of schoolwide reductions in the mode effect between 
the first and second years. In Column (2), we see no improvement in the mode effect for third graders 
between 2015 and 2016, which suggests that improvement in online test outcomes result from students 


becoming more familiar with the exam. Similarly, when we control directly for students’ prior number of 


1 The mode effects could also diminish over time if schools with less successful implementation of online testing 
were disproportionately likely to switch back to paper testing in 2016. We do find some evidence of this: first-year 
test mode effects were larger by 0.02 and 0.05 standard deviations in math and ELA, respectively, for schools that 
did not administer the test online in both years, although the interaction is only statistically significant in ELA. 
Accounting for this heterogeneity in mode effects reduces the improvement in second-year online scores in columns 
(1) and (4) by about 0.01 standard deviations in math and 0.02 standard deviations in ELA. 
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online test administrations in Column (3), we find little evidence of schoolwide improvements from 2015 
to 2016. This suggests that student familiarity explains much of the change. 

On the other hand, we do find evidence of general administrative effects on the ELA test. In 
Column (5), we see relatively large reductions in the mode effect among third graders (the coefficient on 
second year school), which is consistent with general schoolwide improvements in online test results. The 
reduction in the mode effect for older students is slightly larger than for third graders (coefficient on 
second year grade), but this difference is not statistically significant. However, when we control directly 
for the number of times a student has previously taken the PARCC, we do find more evidence of a 
familiarity effect. The coefficient on a student’s second administration is about 0.08 standard deviations, 
and including this measure reduces the schoolwide administration effect from 0.08 to 0.05 standard 
deviations. Overall, student familiarity with the PARCC assessment explains almost all of the reduction 
in the mode effect for the math test. Although student familiarity may also be important for explaining 
reductions in the mode effect for ELA, it appears there is also a role for general school improvements in 
test administration. Nonetheless, students taking online tests scored lower than those taking paper tests in 
both subjects during the second year of administration. 

5.4 Benchmarking effect sizes 

Using Table 5 in conjunction with the findings of Lipsey et al. (2012), we compute the size of 
large test mode effects in terms of measured months of learning. In particular, we use Lipsey et al.’s 
estimates of annual learning gains in a given grade and subject, to translate the effect sizes in Columns (3) 
and (6) of Table 5 into months of learning given a 9 month school year.” The effect size ranges from 1.4 
months (grade 3) to 5.4 months (grade 8) in math and 3.6 months (grade 3) to 11.0 months (grade 7) in 


ELA. 


22 An example of this calculation is 9*[(0.137+0.054) / 0.32], where 0.137+0.054 is the grade 8 math mode effect 
(Column 3 of Table 5) and 0.32 is Lipsey et al.’s (2012) estimate of the annual learning gain between spring of 
grade 7 and spring of grade 8. 
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6. Heterogeneous Effects 
6.1 Heterogeneity in mode effects by student characteristics 

To assess the extent to which certain subgroups are disproportionately impacted by the shift to 
online testing, we re-estimate Eq. (1) separately for the demographic groups identified in the 
administrative data. We show the results in Table 6, which indicates relatively little variation in mode 
effects by student demographic group. The main exception to these patterns is for English language 
learners, who have significantly larger mode effects on ELA tests, and special education students, who 
score lower on online tests than other students in both subjects. 
6.2 Quantile treatment effects 

We assess the distributional effects of CBTs by estimating quantile treatment effects (QTE) for 
the online test. The QTEs describe the effect of online testing on given quantiles of the test score 
distribution and provide an indication of whether the format of the test affects students differentially by 
their academic aptitude. We estimate the QTE using the two-step method suggested by Firpo (2007). In 
the first step, we estimate a propensity score for test mode using the same set of covariates in the lag score 
model in Eq. (1). Identification therefore rests on the same selection on unobservable assumptions as our 
primary specification. Recall that this set of covariates includes lags of both school and student 
outcomes.”? In the second stage, we construct differences in the counterfactual distributions of test scores 
at each vigintile by quantile regression using the propensity scores as weights. The QTE estimator is 
similar to the more familiar propensity score weighting estimators for mean treatment effects; the primary 
difference is that the model replaces the weighted difference in outcomes between treated and control 
units in the second stage with a difference in sample quantiles constructed using the propensity scores as 


weights. 


3 We use the Stata command ivqte to implement the estimation of QTE (Frolich & Melly, 2010), which uses a logit 
estimator for the propensity score. 

4 To avoid assigning large weights to individual observations, we trim the sample to include observations with 
propensity scores between 0.1 and 0.9 using the rule of thumb procedure suggested by Crump et al. (2009) for the 
average treatment effect; however, estimated effects are not sensitive to this choice. 
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We display the estimated QTE for each vigintile of the test score distribution in Figure 6. In math, 
we find little evidence that the effects differ across the achievement distribution. The estimated QTE 
range from -0.06 at the 5th percentile of the test score distribution to -0.17 at the 75th percentile. 
However, 12 of the 19 estimates are between -0.08 and -0.13. We find more significant evidence of 
variation in the QTE on the ELA test. We estimate larger mode effects on the bottom of the achievement 
distribution, with QTE below the median ranging from -0.41 at the 10" percentile to -0.18 at the 65" 
percentile. Above the median, the estimated QTE are all between -0.18 and -0.25. Choices of test mode 
therefore appear to have relatively minor distributional consequences for the math test, but more 


significant effects on the distribution of ELA achievement. 


7. Discussion 

We find strong evidence that media reports of students scoring lower on CBTs represent true test 
mode effects that cannot be explained by preexisting trends in the performance of schools that initially 
moved to online testing or by the prior achievement of students who take the test online. The effect on 
ELA (-0.24 standard deviations) is larger than the effect on math (-0.10 standard deviations), but both are 
large changes in measured student performance that should concern education agencies using online 
PARCC scores for accountability purposes. In addition, the effects in ELA are most pronounced for 
students at the bottom of the test score distribution. 

Although some policymakers have questioned the consequential use of online assessments during 
the first year of implementation,” we also find test mode effects in the second year of online testing, 


especially in ELA. States or districts that administer PARCC online to some students and on paper to 


5 For example, in response to the observed test mode effects in Baltimore County, Russell Brown, the district’s 
chief accountability and performance-management officer, said that “I think it draws into question the validity of the 
first year’s results for PARCC.” Brown, Emma. Report: Kids who took Common Core test online scored lower than 
those who used paper. The Washington Post, 4 February, 2016. 
https://www.washingtonpost.com/news/education/wp/20 16/02/04/report-kids-who-took-common-core-test-online- 
scored-lower-than-those-who-used-paper 
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other students should be aware that the paper students will likely score systematically higher, even in the 
second year. 

Our findings indicate that policies that reward or sanction students, teachers, or schools based on 
student test scores should take test mode effects into account. In Massachusetts, for example, schools that 
chose to administer PARCC in 2015 and 2016 (whether online or on paper) were subject to a “hold 
harmless” provision, in which schools would be “held harmless for any negative changes in their [...] 
accountability and assistance levels”.”° From 2017 forward, as the state transitions to a new assessment, it 
will use statistical adjustments to correct for mode effects.” Other states may want to consider similar 


strategies during the transition phase to online testing. 


6 Massachusetts Assessment Decision FAQs. http://www.doe.mass.edu/mcas/decisionfag.html 
27 Information obtained through personal correspondence with the state. 
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Figure 1. Reading passage display formats on online and paper assessments 


English Language Arts/Literacy 


Today you will read two stories titled “Johnny Chuck Finds the 
Best Thing in the World” and “Me First.” As you read, think 
about the actions of the characters and the events of the 
stories. Answer the questions to help you write an essay. 


Read the story titled “Johnny Chuck Finds the Best Thing in the World.” 
Then answer questions 1 through 3. 


Johnny Chuck Finds the Best Thing in the World 
by Thornton W. Burgess 


1 Old Mother West Wind had stopped to talk with the Slender Fir Tree. 
2 “I’ve just come across the Green Meadows,” said Old Mother West Wind, “and 
there I saw the Best Thing in the World.” 


3 Striped Chipmunk was sitting under the Slender Fir Tree and he couldn’t help 
hearing what Old Mother West Wind said, “The Best Thing in the World—now 
what can that be?” thought Striped Chipmunk. “Why, it must be heaps and 


heaps of nuts and acorns! I'll go and find it.” e 


4 So Striped Chipmunk started down the Lone Little Path through the wood as. 
fast as he could run. Pretty soon he met Peter Rabbit. 


5 “Where are you going in such a hurry, Striped Chipmunk?” asked Peter 
Rabbit. 


6 “Down in the Green Meadows to find the Best Thing in the World,” replied 
Striped Chipmunk, and ran faster. 


7 “The Best Thing in the World,” said Peter Rabbit, “why, that must be a great 
pile of carrots and cabbage! I think I'll go and find it,” 


8 So Peter Rabbit started down the Lone Little Path through the wood as fast 
as he could go after Striped Chipmunk. 


9 As they passed the great hollow tree Bobby Raccoon put his head out. 
“Where are you going in such a hurry?” asked Bobby Raccoon. 


10 “Down in the Green Meadows to find the Best Thing in the World!” shouted 
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11 “The Best Thing in the World,” said Bobby Raccoon to himself, “why, that 
must be a whole field of sweet milky corn! I think I'll go and find it.” 

12 So Bobby Raccoon climbed down out of the great hollow tree and started 
down the Lone Little Path through the wood as fast as he could go after 
Striped Chipmunk and Peter Rabbit, for there is nothing that Bobby Raccoon 
likes to eat so well as sweet milky corn. 

13 At the edge of the wood they met Jimmy Skunk. 

14 “Where are you going in such a hurry?” asked Jimmy Skunk. 

15 “Down in the Green Meadows to find the Best Thing in the World!” shouted 
Striped Chipmunk and Peter Rabbit and Bobby Raccoon. Then they all tried to 
run faster. 

16 “The Best Thing in the World,” said Jimmy Skunk. “Why, that must be packs 
and packs of beetles!” And for once in his life Jimmy Skunk began to hurry 
down the Lone Little Path after Striped Chipmunk and Peter Rabbit and 
Bobby Raccoon. 


17 They were all running so fast that they didn’t see Reddy Fox until he jumped 
out of the long grass and asked: 


@ 18 “Where are you going in such a hurry?” 


19 “To find the Best Thing in the World!” shouted Striped Chipmunk and Peter 
Rabbit and Bobby Raccoon and Jimmy Skunk, and each did his best to run 
faster. 


20 “The Best Thing in the World,” said Reddy Fox to himself. “Why, that must be 
a whole pen full of tender young chickens, and I must have them.” 


21 So away went Reddy Fox as fast as he could run down the Lone Little Path 
after Striped Chipmunk, Peter Rabbit, Bobby Raccoon and Jimmy Skunk. 


22 By and by they all came to the house of Johnny Chuck. 
23 “Where are you going in such a hurry?” asked Johnny Chuck. 


24 “To find the Best Thing in the World,” shouted Striped Chipmunk and Peter 
Rabbit and Bobby Raccoon and Jimmy Skunk and Reddy Fox. 


25 “The Best Thing in the World,” said Johnny Chuck. “Why I don’t know of 
anything better than my own little home and the warm sunshine and the 


Striped Chipmunk and Peter Rabbit, and both began to run faster. beautiful blue sky.” = 
; ? GO ON > 5 GOONP = 
PLEASE DO NOT WRITE IN THIS AREA - 
lelelelelolelelolelelelelelelolelelelejolelojese) SERIAL # - 
- - = = - 
(a) Paper Format 
€ Review := (\ Bookmark B x Guest &~ 


GRADE 3 ELA/LITERACY -UNIT1 / UNIT1 / 10F 11 


Today you will read two stories titled “Johnny Chuck Finds the Best 
Thing in the World” and “Me First.” As you read, think about the actions 
of the characters and the events of the stories. Answer the questions to 
help you write an essay. 


PartA 


What does cross mean as it is used in paragraph 28 of “Johnny Chuck 
Finds the Best Thing in the World”? 


A. excited 
Read the story titled “Johnny Chuck Finds the Best Thing in the 
World.” Then answer the questions. B. lost 

Johnny Chuck Finds the Best Thing in the World 
C. upset 
by Thornton W. Burgess 

@© Old Mother West Wind had stopped to talk with the Slender Fir Di ac 
Tree. 
© “I've just come across the Green Meadows,” said Old Mother PartB 


West Wind, “and there | saw the Best Thing in the World.” 
Which statement best supports the answer to Part A? 


© Striped Chipmunk was sitting under the Slender Fir Tree and he 


couldn't help hearing what Old Mother West Wind said. “The Best A. “...ran this way and ran that way .. .” 

Thing in the World—now what can that be?” thought Striped 

Chipmunk. “Why, it must be heaps and heaps of nuts and acorns! I'll B. “... hadn't found the Best Thing in the World.” 

go and find it.” 

@ So Striped Chipmunk started down the Lone Little Path through C. “... they started up the Lone Little Path...” 

the wood as fast as he could run. Pretty soon he met Peter Rabbit. 

© “Where are you going in such a hurry, Striped Chipmunk?” D. “They didn’t hurry now... .” 
\___aekad Dator Rabhit 


(b) Online Format 
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Figure 2. Multiple-choice question display formats on online and paper assessments 


more 
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PartA 
Which statement best describes what the picture adds to the story? 
® The picture shows that Pinkerton is hungry. 


The picture shows that the other Scouts are upset at Pinkerton. 


The picture shows that Pinkerton has learned how to behave. 


(a) Paper Format 


© 
© The picture shows that the Scouts take different items to the beach. 
© 


PartA 


Select one sentence from this drop-down menu that describes what is 
happening in this picture. Then, select one sentence from the next drop- 
down menu that describes what the picture adds to the story. 


What is Happening? 
( Choose... 


o 


What the Picture Adds 


The picture shows that Pinkerton is hungry. 
The picture shows that the other Scouts are upset at Pinkerton. 
The picture shows the different things the Scouts take to the beach. 


(b) Online Format 


Figure 3. Free-response question formats on online and paper assessments 


Old Mother West Wind and the Sandwitch both try to teach important 
lessons to characters in the stories. 


Write an essay that explains how Old Mother West Wind's and the 
Sandwitch’s words and actions are important to the plots of the stories. 
Use what you learned about the characters to support your essay. 


i Old Mother West Wind and the Sandwitch both try to teach important 
lessons to characters in the stories. 


un GO ON > 


(a) Paper Format 


Write an essay that explains how Old Mother West Wind’s and the 
Sandwitch’s words and actions are important to the plots of the stories. Use 
what you learned about the characters to support your essay. 


BIYEE SO % 


(b) Online Format 
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Figure 4. Standardized achievement by year 


Math 


Std. Achievement 
Std. Achievement 


2011 2012 2013 2014 2015 2016 2011 2012 2013 2014 2015 2016 
Year Year 


—e— Paper, 2015 & 2016 —#— Online, 2015 & 2016 


——@-—- Online 2015, Paper 2016 ——®—- Paper 2015, Online 2016 


Notes: Mean standardized test scores by year for schools administering PARCC in both 2015 and 2016. All test 
scores have been standardized by grade and year within the sample of schools administering the PARCC in 2015 


and 2016. 
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Figure 5. Effects of computer-based testing by year relative to implementation 
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Notes: Estimates of impulse response function from regression of achievement on current test mode and three leads 
and one lag, school-by-grade and grade-by-year fixed effects, and school characteristics. Estimates are normalized 
relative to the year before implementation. Standard errors clustered at the school level. 
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Std. Achievement 


0.055 


ELA 


Figure 6. Quantile treatment effects of computer-based testing 
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Notes: Estimated quantile treatment effects at each decile of the test score distribution for math and ELA 
achievement. The sample includes all schools in 2015 and 2016 administering the PARCC assessment in Grades 5 
through 8. We estimate quantile treatment effects using the two-step estimator suggested by Firpo (2007), trimming 
the sample to include observations with propensity scores between 0.1 and 0.9. The first-stage propensity score 
includes twice-lagged achievement, student race/ethnicity, free-and-reduced-price-lunch status, special education 
status, limited English proficiency status, grade and year indicators, and each of the means of each of these variables 
at the school-year and school-grade-year levels. Regressions also include mean school achievement for each year 
between 2011 and 2014. Standard errors estimated by clustered bootstrap at the school level with 199 iterations. 
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Table 1. Test transition in Massachusetts 


2015 Assessment 2016 Assessment Number of Schools Number of Students 
MCAS MCAS 339 219,075 
MCAS PARCC Paper 164 92,881 
MCAS PARCC Online 39 32,692 
PARCC Paper PARCC Paper 285 166,508 
PARCC Paper PARCC Online 26 18,398 
PARCC Online PARCC Paper 80 49,302 
PARCC Online PARCC Online 267 157,519 


Notes: Each row shows the number of schools in the analysis sample with the given assessment choice. We include 
the bolded schools, which administered PARCC in both 2015 and 2016, in the analysis sample. 
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Table 2. Summary statistics 


2015-2016 2011-2016 
Paper Online Online Paper 
Paper Online Paper Online Paper Online 
Q) (2) (3) (4) (5) (6) 

Math test 0.007 -0.008 -0.029 0.048 0.003 -0.159 
(1.021) (0.977) (1.012) (0.978) (1.027) (0.984) 

ELA test 0.060 -0.063 -0.015 0.045 -0.029 -0.171 
(1.001) (0.995) (1.013) (0.976) (1.022) (0.993) 

Twice-lagged math test -0.045 0.045 -0.055 0.059 0.052 -0.145 
(1.009) (0.989) (1.005) (0.983) (1.025) (0.989) 

Twice-lagged ELA test -0.056 0.057 -0.065 0.075 0.030 -0.143 
(1.009) (0.988) (1.011) (0.978) (1.016) (0.991) 

Male 0.509 0.510 0.509 0.509 0.505 0.514 
(0.500) (0.500) (0.500) (0.500) (0.500) (0.500) 

Asian 0.062 0.060 0.056 0.054 0.077 0.029 
(0.241) (0.237) (0.230) (0.225) (0.267) (0.167) 

African American 0.118 0.082 0.118 0.076 0.119 0.090 
(0.322) (0.275) (0.322) (0.264) (0.324) (0.286) 

Hispanic 0.193 0.147 0.170 0.118 0.222 0.202 
(0.395) (0.354) (0.376) (0.323) (0.416) (0.402) 

Free-lunch eligible 0.405 0.327 0.383 0.283 0.394 0.393 
(0.491) (0.469) (0.486) (0.450) (0.489) (0.488) 

Reduced-price-lunch eligible 0.030 0.038 0.047 0.046 0.044 0.053 
(0.172) (0.190) (0.211) (0.210) (0.205) (0.224) 

Limited English proficient 0.089 0.059 0.082 0.049 0.093 0.081 
(0.285) (0.235) (0.275) (0.215) (0.290) (0.273) 

Special education 0.171 0.163 0.169 0.159 0.170 0.184 
(0.377) (0.369) (0.374) (0.365) (0.375) (0.388) 
N 200,409 191,318 506,234 480,766 149,968 56,960 


Notes: Summary statistics for 2015 and 2016 by school test and test mode choice. 
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Table 3. Estimated test mode effects 


Math ELA 
OLS DID OLS DID 
() (2) (3) (4) (5) (6) 
Online test -0.098*** — -0.100*** — -0.106*** 0.237%" 0.247" 0.236 
(0.012) (0.014) (0.011) (0.012) (0.013) (0.010) 
N 234,397 778,332 1,186,422 235,194 777,874 1,184,736 
Grades 5-8 5-8 3-8 5-8 5-8 3-8 
Years 2015-2016 2011-2016 2011-2016 2015-2016 —-2011-2016 2011-2016 


Notes: Regressions of standardized math (Columns (1)—(3)) and ELA (Columns (4)-(6)) scores on test mode. 
Regressions in Columns (1) and (4) use achievement data from 2014-15 and 2015-16 and include a cubic 
polynomial in twice-lagged achievement, student race/ethnicity, free-and-reduced-price-lunch status, special 
education status, limited English proficiency status, grade-by-year indicators, and each of the means of each of these 
variables at the school-year and school-grade-year levels. Regressions also include mean school achievement for 
each year between 2011 and 2014. Regressions in Columns (2)—(3) and (5)—(6) use achievement data from 2011-12 
through 2015—16 and replace all school- and individual-level prior achievement variables with school-by-grade 
fixed effects. Regressions in Columns (2) and (5) use Grades 5-8 only; those in Columns (3) and (6) also include 
Grades 3 and 4. Standard errors clustered by school in parentheses. *p < 0.10, **p < 0.05, ***p < 0.01 


Table 4. Robustness checks 


Science (placebo) Math: Columns 3-5 ELA: Columns 6-8 
() (2) (3) (4) (5) (6) (7) (8) 
Online test -0.005 -0.000 -0.116*** -O.101***  -O.L11*** = -0.272*** = -0.207*** = -0.248*** 


(0.016) (0.012) (0.011) (0.009) (0.013) (0.012) (0.009) (0.013) 


Student Student 
Model OLS DID DID DID FE DID DID FE 
Grades 5,8 5,8 3-8 3-8 3-8 3-8 3-8 3-8 
2015- 2011- 2015- 2011- 2011- 2015- 2011- 2011- 
Years 2016 2016 2016 2016 2016 2016 2016 2016 
N 116,881 386,246 389,273 1,567,707 1,186,422 389,657 1,565,192 1,184,736 


Notes: Regressions of standardized science (Columns (1) and (2)), math (Columns (3) - (6)), and ELA (Columns (6) 
- (8)) scores on school’s PARCC test mode. Regression in Column (1) uses achievement data from 2014—15 and 
2015-16 and includes a cubic polynomial in twice-lagged achievement in math and ELA, student race/ethnicity, 
free-and-reduced-price-lunch status, special education status, limited English proficiency status, grade-by-year 
indicators, and each of the means of each of these variables at the school-year and school-grade-year levels. The 
regression also include mean math and ELA school achievement for each year between 2011 and 2014. The 
regression in Column (2) uses achievement data from 2011-12 through 2015-16 and replaces all school- and 
individual-level prior achievement variables with school-by-grade fixed effects. Regressions in Columns (3) and (6) 
use only data from PARCC schools in 2015 and 2016. Regressions in Columns (4) and (7) also add the schools 
adopting the PARCC assessment in 2016 to the sample (schools in Rows 2 and 3 of Table 1). Regressions in 
Columns (5) and (8) use student fixed effects instead of the school-by-grade fixed effects and do not control for any 
individual-level variables. Standard errors clustered by school in parentheses. *p < 0.10, **p < 0.05, ***p < 0.01 
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Table 5. Mode effects by year of administration 


Math ELA 
Q) (2) (3) (4) (5) (6) 

Online -0.165*** -0.134*** -0.135*** -0.309*** -0.297 # -0.284*#* 
(0.018) (0.019) (0.018) (0.015) (0.016) (0.016) 

x 2nd year school 0.059*** -0.011 -0.010 0.109*** 0.080*** 0.051*** 
(0.010) (0.022) (0.015) (0.011) (0.019) (0.014) 

x 2nd year grade 0.084*** 0.034 

(0.024) (0.022) 
x 2nd year student 0.094 *** 0.079*** 
(0.015) (0.014) 

x 4th grade 0.039* 0.001 0.000 0.093*** 0.078*** 0.060*** 
(0.020) (0.021) (0.020) (0.019) (0.020) (0.019) 

x 5th grade 0.083*** 0.046* 0.045** 0.017 0.002 -0.014 
(0.022) (0.024) (0.023) (0.019) (0.021) (0.019) 

x 6th grade 0.044 0.007 0.008 -0.037 -0.051** -0.066*** 
(0.027) (0.027) (0.027) (0.023) (0.023) (0.022) 

x 7th grade 0.056** 0.018 0.017 0.028 0.013 -0.004 
(0.027) (0.027) (0.027) (0.026) (0.027) (0.026) 

x 8th grade -0.019 -0.057* -0.060** 0.069*** 0.054** 0.036 
(0.029) (0.030) (0.029) (0.026) (0.027) (0.026) 

N 1,186,422 1,186,422 1,186,422 1,184,736 1,184,736 1,184,736 


Notes: Difference-in-differences estimates of mode effects by grade and year of administration. All models include 
same variables as in Table 3. Standard errors clustered by school in parentheses. *p < 0.10, **p < 0.05, ***p < 0.01 
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Table 6. Heterogeneous effects 


Math ELA Math ELA 
Full Sample Asian 
Online -0.106 -0.236 Online -0.089 -0.208 
(0.011) (0.010) (0.022) (0.019) 
[1.000] [1.000] [0.335] [0.087] 
Male Free/Reduced-Price Lunch 
Online -0.106 -0.235 Online -0.117 -0.260 
(0.011) (0.011) (0.014) (0.014) 
[0.714] [0.562] [0.325] [0.010] 
Hispanic Limited English Proficient 
Online -0.119  -0.250 Online -0.107 -0.284 
(0.018) (0.016) (0.023) (0.025) 
[0.406] [0.295] [0.988] [0.037] 
African American Special Education 
Online -0.127 -0.269 Online -0.138 -0.288 
(0.020) (0.023) (0.014) (0.014) 
[0.265] [0.090] [0.004] [0.000] 
N 234,397 235,194 _ N 234,397 235,194 


Notes: Regressions of standardized math and ELA scores on test mode. Regressions 
use same base sample (grades 3-8) and specifications as DID estimates in Table 3. 
Each cell represents a separate regression using the specified subsample. Standard 
errors clustered by school in parentheses. P-value of test against equality to full sample 
coefficient in brackets. *p < 0.10, **p < 0.05, ***p < 0.01 
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Appendix Figure 1. Scale scores by year 


Math 


246.0 


244.0 


Scale Score 
ie) 
pS 
ho 
[= | 
Scale Score 


238.0 
2011 2012 2013 2014 2015 2016 2011 2012 2013 2014 2015 2016 
Year Year 


—e— Paper, 2015 & 2016 —#— Online, 2015 & 2016 


——@-—- Online 2015, Paper 2016 ——®—- Paper 2015, Online 2016 


Notes: Mean test scores by year for schools administering PARCC in both 2015 and 2016, measured on the MCAS 
scale. PARCC scale scores in 2015 and 2016 are translated to the 2011-2014 MCAS scale by the state. 
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